Please reference the video lecture for an overview of Decision Trees and Random Forests. This is just a reference notebook for the lecture video's code.
#install.packages('rpart)
library(rpart)
We can then use the rpart() function to build a decision tree model:
rpart(formula, data=, method=,control=) where
We'll use the kyphosis data frame which has 81 rows and 4 columns. representing data on children who have had corrective spinal surgery. It has the following columns:
Kyphosis-a factor with levels absent present indicating if a kyphosis (a type of deformation) was present after the operation.
Age-in months
Number-the number of vertebrae involved
Start-the number of the first (topmost) vertebra operated on.
Let's check out the structure:
str(kyphosis)
head(kyphosis)
tree <- rpart(Kyphosis ~ . , method='class', data= kyphosis)
There are lots of functions you can use to examine your tree model:
</table></p>
| printcp(fit) | display cp table |
| plotcp(fit) | plot cross-validation results |
| rsq.rpart(fit) | plot approximate R-squared and relative error for different splits (2 plots). labels are only appropriate for the "anova" method. |
| print(fit) | print results |
| summary(fit) | detailed results including surrogate splits |
| plot(fit) | plot decision tree |
| text(fit) | label the decision tree plot |
| post(fit, file=) | create postscript plot of decision tree |
Let's see a few of them:
printcp(tree)
There are some built-in visualization capabilities from the table above, but they aren't very good looking:
plot(tree, uniform=TRUE, main="Main Title")
text(tree, use.n=TRUE, all=TRUE)
The rpart.plot library package makes these visualizations much better.
#install.packages('rpart.plot')
library(rpart.plot)
prp(tree)
Random forests improve predictive accuracy by generating a large number of bootstrapped trees (based on random samples of variables), classifying a case using each tree in this new "forest", and deciding a final predicted outcome by combining the results across all of the trees (an average in regression, a majority vote in classification).
We can use the randomForest library to create and build out a Random Forest:
# Random Forest prediction of Kyphosis data
library(randomForest)
model <- randomForest(Kyphosis ~ ., data=kyphosis)
print(model) # view results
importance(model) # importance of each predictor
You should be beginning to feel very comfortable with the syntax for training a model on data. The key is to just understand the background of the algorithm being used and know what library to install and use for the specific algorithm being used.